Under the auspices of the Institute of Computer Science at the University of Tartu, open-source language models will be trained to speak Estonian more fluently and better understand Estonian culture. In this way, we can preserve and protect the Estonian language in the face of the rapid development of artificial intelligence and create applications that Estonians can conveniently use.
Chatbots, text summarizers, content aggregators, question answering systems, etc., use large language models. In order for such applications to work well, the models need to have a good command of Estonian. Kairit Sirts, Associate Professor in Natural Language Processing at the University of Tartu, says that in conversations with artificial intelligence, Estonian often sounds artificial and clumsy. "Some open-source language models already speak Estonian to a certain extent, but our aim is to make the language models speak the way people actually speak. Instead of eloquence, Estonians tend to be straightforward and laconic. We can train the model to take into account the Estonian cultural context and also to get better at grammar," said Kairit Sirts.
Language models created by big tech companies are aimed at the masses, and we have no control over them. For example, OpenAI ChatGPT cannot be used in areas that require confidentiality, such as state defence or healthcare. Now, Estonian researchers continue to train existing open-source language models with more Estonian texts so that in the future, it will be possible to create secure, high-quality AI applications that understand Estonian language and context.
According to Associate Professor Kairit Sirts, it is important to maintain and grow competence in large language models in our research community. "For technology companies, the Estonian language or cultural background is not something that matters. We have to look after these things ourselves. Thanks to the new project, we are also developing people's skills and knowledge so that we do not remain on the sidelines of technological developments," she said.
The project “Estonian language support in open-source large generative language models”, which was launched this year, brings together the best expertise in the field in Estonia. From the University of Tartu, alongside associate professor Kairit Sirts, professor Mark Fišel and students are participating in the project. Associate Professor Tanel Alumäe from Tallinn University of Technology is contributing with his students, and from the Institute of the Estonian Language, Natural Language Processing Engineer Eleri Aedmaa leads the work.
The project is funded by the national programme "Estonian Language Technology 2018–2027". The language models will be trained on LUMI, the largest supercomputer in Northern Europe, located in Finland. The initial duration of the project is two years. The first results with models adapted to the Estonian language are expected by June 2025.